Missing data is part of any real world data analysis. It can crop up in unexpected places, making analyses challenging to understand. In this course, you will learn how to use tidyverse tools and the naniar R package to visualize missing values. You’ll tidy missing values so they can be used in analysis and explore missing values to find bias in the data. Lastly, you’ll reveal other underlying patterns of missingness. You will also learn how to “fill in the blanks” of missing values with imputation models, and how to visualize, assess, and make decisions based on these imputed datasets.
Chapter 1 introduces you to missing data, explaining what missing values are, their behavior in R, how to detect them, and how to count them. We then introduce missing data summaries and how to summarise missingness across cases, variables, and how to explore across groups within the data. Finally, we discuss missing data visualizations, how to produce overview visualizations for the entire dataset and over variables, cases, and other summaries, and how to explore these across groups.
#Chapter 1
Introduction to missing data Hi, my name is Nick Tierney, I’m a statistician and writer of the naniar package, which makes it easy to work with missing data, and I’ll be your instructor for this course.
Introduction Statistician Gertrude Mary Cox once said: “The best thing to do with missing data is to not have any”. True as that is, it is not the world we live in. Working with real world data means working with missing data. To be a great analyst you need to know how to deal with missing values. Understanding how missing data works is important as they can have unexpected effects on your analysis. For example, fitting a linear model on data with missing values deletes chunks of data. This means your decisions aren’t based on the right evidence. Replacing missing values, which is called imputation, has to be done very carefully - inserting only the mean can lead to poor estimates and decisions.
What will you learn In this course you will learn about what missing values are, how to find missing data, how to wrangle and tidy missing data, why is data missing, and imputing missing values.
Assumed knowledge For this course I will assume you have basic to intermediate experience with R, experience creating plots using ggplot2, experience using dplyr to manipulate data, and experience fitting linear models in R. In this first chapter we introduce missing values and how to check for and count them.
What are missing values? Before we get started, we need to define missing values Missing values are values that should have been recorded, but were not. Think of it this way: You might accidentally not record seeing a bird - this is a missing value. This is different to recording that there were no birds observed. R stores missing values as NA, which stands for not available.
How do I check if I have missing values? Missing values don’t jump out and scream “I’m here!”. They’re usually hidden, like a needle in a haystack. To detect missing values use any_na, which returns TRUE if there are any missings, and FALSE if there are none. are_na asks “are these NA?” and returns TRUE/FALSE for each value are_na shows us 3 TRUE values - 3 missing values. To avoid counting each TRUE yourself, n_miss counts the number of missings And prop_miss gives the proportion of missings, which gives important context: 50% of data is missing!
Working with missing data So what happens when we mix missing values with our calculations? We need to know what happens, so we can be primed to find these cases. The general rule is this: Calculations with NA return NA. Say you have the height of three friends: Sophie, Dan, and Fred. The sum of their heights returns NA, This is because we don’t know the sum of a number and NA.
Missing data gotchas There are some “gotchas” when working with missing data to be aware of: For example, NaN is “Not a Number”, and comes from operations like the square root of -1. R actually interprets NaN as a missing value. NULL is an empty value but is not missing. This is subtly different from missing: An empty bucket isn’t missing water. Inf is an infinite value, and results from equations like 10 divided by 0 and is not missing.
Missing data gotchas (2) Finally, Beware of conditional statements with missings. For example, NA or TRUE is TRUE. NA or FALSE is NA. NA + NaN is NA. NaN + NA is NaN.
Let’s practice! Let’s practice!
When working with missing data, there are a couple of commands that you should be familiar with - firstly, you should be able to identify if there are any missing values, and where these are.
Using the any_na() and are_na() tools, identify which values are missing.
# Create x, a vector, with values NA, NaN, Inf, ".", and "missing"
x <- c(NA, NaN,Inf, ".", "missing")
# Use any_na() and are_na() on to explore the missings
any_na(x)
## [1] TRUE
are_na(x)
## [1] TRUE FALSE FALSE FALSE FALSE
One of the first things that you will want to check with a new dataset is if there are any missing missing values, and how many there are.
You could use are_na() to and count up the missing values, but the most efficient way to count missings is to use the n_miss() function. This will tell you the total number of missing values in the data.
You can then find the percent of missing values in the data with the pct_miss function. This will tell you the percentage of missing values in the data.
You can also find the complement to these - how many complete values there are - using n_complete and pct_complete.
# Using the example dataframe of heights and weights dat_hw
dat_hw <- read.table("~/edi_imp/datacamp_JB/dealing/dat_hw.txt", h=T, dec=".")
# Use n_miss() to count the total number of missing values in dat_hw
naniar::n_miss(dat_hw)
## [1] 30
# Use n_miss() on dat_hw$weight to count the total number of missing values
naniar::n_miss(dat_hw$weight)
## [1] 15
# Use n_complete() on dat_hw to count the total number of complete values
n_complete(dat_hw)
## [1] 170
# Use n_complete() on dat_hw$weight to count the total number of complete values
n_complete(dat_hw$weight)
## [1] 85
# Use prop_miss() and prop_complete() on dat_hw to count the total number of missing values in each of the variables
prop_miss(dat_hw)
## [1] 0.15
prop_complete(dat_hw)
## [1] 0.85
Now that you understand the behavior of missing values in R, and how to count them, let’s scale up our summaries for cases (rows) and variables, using miss_var_summary() and miss_case_summary(), and also explore how they can be applied for groups in a dataframe, using the group_by function from dplyr.
# Summarise missingness in each variable of the `airquality` dataset
miss_var_summary(airquality)
## # A tibble: 6 × 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 Ozone 37 24.2
## 2 Solar.R 7 4.58
## 3 Wind 0 0
## 4 Temp 0 0
## 5 Month 0 0
## 6 Day 0 0
# Summarise missingness in each case of the `airquality` dataset
miss_case_summary(airquality)
## # A tibble: 153 × 3
## case n_miss pct_miss
## <int> <int> <dbl>
## 1 5 2 33.3
## 2 27 2 33.3
## 3 6 1 16.7
## 4 10 1 16.7
## 5 11 1 16.7
## 6 25 1 16.7
## 7 26 1 16.7
## 8 32 1 16.7
## 9 33 1 16.7
## 10 34 1 16.7
## # … with 143 more rows
# Return the summary of missingness in each variable, grouped by Month, in the `airquality` dataset
airquality %>% group_by(Month) %>% miss_var_summary()
## # A tibble: 25 × 4
## # Groups: Month [5]
## Month variable n_miss pct_miss
## <int> <chr> <int> <dbl>
## 1 5 Ozone 5 16.1
## 2 5 Solar.R 4 12.9
## 3 5 Wind 0 0
## 4 5 Temp 0 0
## 5 5 Day 0 0
## 6 6 Ozone 21 70
## 7 6 Solar.R 0 0
## 8 6 Wind 0 0
## 9 6 Temp 0 0
## 10 6 Day 0 0
## # … with 15 more rows
# Return the summary of missingness in each case, grouped by Month, in the `airquality` dataset
airquality %>% group_by(Month) %>% miss_case_summary()
## # A tibble: 153 × 4
## # Groups: Month [5]
## Month case n_miss pct_miss
## <int> <int> <int> <dbl>
## 1 5 5 2 40
## 2 5 27 2 40
## 3 5 6 1 20
## 4 5 10 1 20
## 5 5 11 1 20
## 6 5 25 1 20
## 7 5 26 1 20
## 8 5 1 0 0
## 9 5 2 0 0
## 10 5 3 0 0
## # … with 143 more rows
The summaries of missingness we just calculated give us the number and percentage of missing observations for the cases and variables.
Another way to summarize missingness is by tabulating the number of times that there are 0, 1, 2, 3, missings in a variable, or in a case.
In this exercise we are going to tabulate the number of missings in each case and variable using miss_var_table() and miss_case_table(), and also combine these summaries with the the group_by operator from dplyr. to explore the summaries over a grouping variable in the dataset.
# Tabulate missingness in each variable and case of the `airquality` dataset
miss_case_table(airquality)
## # A tibble: 3 × 3
## n_miss_in_case n_cases pct_cases
## <int> <int> <dbl>
## 1 0 111 72.5
## 2 1 40 26.1
## 3 2 2 1.31
miss_var_table(airquality)
## # A tibble: 3 × 3
## n_miss_in_var n_vars pct_vars
## <int> <int> <dbl>
## 1 0 4 66.7
## 2 7 1 16.7
## 3 37 1 16.7
# Tabulate the missingness in each variable, grouped by Month, in the `airquality` dataset
airquality %>% group_by(Month) %>% miss_var_table()
## # A tibble: 12 × 4
## # Groups: Month [5]
## Month n_miss_in_var n_vars pct_vars
## <int> <int> <int> <dbl>
## 1 5 0 3 60
## 2 5 4 1 20
## 3 5 5 1 20
## 4 6 0 4 80
## 5 6 21 1 20
## 6 7 0 4 80
## 7 7 5 1 20
## 8 8 0 3 60
## 9 8 3 1 20
## 10 8 5 1 20
## 11 9 0 4 80
## 12 9 1 1 20
# Tabulate of missingness in each case, grouped by Month, in the `airquality` dataset
airquality %>% group_by(Month) %>% miss_case_table()
## # A tibble: 11 × 4
## # Groups: Month [5]
## Month n_miss_in_case n_cases pct_cases
## <int> <int> <int> <dbl>
## 1 5 0 24 77.4
## 2 5 1 5 16.1
## 3 5 2 2 6.45
## 4 6 0 9 30
## 5 6 1 21 70
## 6 7 0 26 83.9
## 7 7 1 5 16.1
## 8 8 0 23 74.2
## 9 8 1 8 25.8
## 10 9 0 29 96.7
## 11 9 1 1 3.33
Some summaries of missingness are particularly useful for different types of data. For example, miss_var_span() and miss_var_run().
miss_var_span() calculates the number of missing values in a specified variable for a repeating span. This is really useful in time series data, to look for weekly (7 day) patterns of missingness.
miss_var_run() calculates the number of “runs” or “streaks” of missingness. This is useful to find unusual patterns of missingness, for example, you might find a repeating pattern of 5 complete and 5 missings.
Both miss_var_span() and miss_var_run() work with the group_by operator from dplyr.
# Calculate the summaries for each run of missingness for the variable, hourly_counts
miss_var_run(pedestrian, var = hourly_counts)
## # A tibble: 35 × 2
## run_length is_na
## <int> <chr>
## 1 6628 complete
## 2 1 missing
## 3 5250 complete
## 4 624 missing
## 5 3652 complete
## 6 1 missing
## 7 1290 complete
## 8 744 missing
## 9 7420 complete
## 10 1 missing
## # … with 25 more rows
# Calculate the summaries for each span of missingness, for a span of 4000, for the variable hourly_counts
miss_var_span(pedestrian, var = hourly_counts, span_every = 4000)
## # A tibble: 10 × 6
## span_counter n_miss n_complete prop_miss prop_complete n_in_span
## <int> <int> <int> <dbl> <dbl> <int>
## 1 1 0 4000 0 1 4000
## 2 2 1 3999 0.00025 1.00 4000
## 3 3 121 3879 0.0302 0.970 4000
## 4 4 503 3497 0.126 0.874 4000
## 5 5 745 3255 0.186 0.814 4000
## 6 6 0 4000 0 1 4000
## 7 7 1 3999 0.00025 1.00 4000
## 8 8 0 4000 0 1 4000
## 9 9 745 3255 0.186 0.814 4000
## 10 10 432 1268 0.254 0.746 1700
# For each `month` variable, calculate the run of missingness for hourly_counts
pedestrian %>% group_by(month) %>% miss_var_run(var = hourly_counts)
## # A tibble: 51 × 3
## # Groups: month [12]
## month run_length is_na
## <ord> <int> <chr>
## 1 January 2976 complete
## 2 February 2784 complete
## 3 March 2976 complete
## 4 April 888 complete
## 5 April 552 missing
## 6 April 1440 complete
## 7 May 744 complete
## 8 May 72 missing
## 9 May 2160 complete
## 10 June 2880 complete
## # … with 41 more rows
# For each `month` variable, calculate the span of missingness of a span of 2000, for the variable hourly_counts
pedestrian %>% group_by(month) %>% miss_var_span(var = hourly_counts, span_every = 2000)
## # A tibble: 25 × 7
## # Groups: month [12]
## month span_counter n_miss n_complete prop_miss prop_complete n_in_span
## <ord> <int> <int> <int> <dbl> <dbl> <int>
## 1 January 1 0 2000 0 1 2000
## 2 January 2 0 976 0 1 976
## 3 February 1 0 2000 0 1 2000
## 4 February 2 0 784 0 1 784
## 5 March 1 0 2000 0 1 2000
## 6 March 2 0 976 0 1 976
## 7 April 1 552 1448 0.276 0.724 2000
## 8 April 2 0 880 0 1 880
## 9 May 1 72 1928 0.036 0.964 2000
## 10 May 2 0 976 0 1 976
## # … with 15 more rows
It can be difficult to get a handle on where the missing values are in your data, and here is where visualization can really help.
The function vis_miss() creates an overview visualization of the missingness in the data. It also has options to cluster rows based on missingness, using cluster = TRUE; as well as options for sorting the columns, from most missing to least missing (sort_miss = TRUE).
# Visualize all of the missingness in the `riskfactors` dataset
vis_miss(riskfactors)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
# Visualize and cluster all of the missingness in the `riskfactors` dataset
vis_miss(riskfactors, cluster = TRUE)
# visualise and sort the columns by missingness in the `riskfactors` dataset
vis_miss(riskfactors, sort_miss = TRUE)
To get a clear picture of the missingness across variables and cases, use gg_miss_var() and gg_miss_case(). These are the visual counterpart to miss_var_summary() and miss_case_summary().
These can be split up into multiple plots with one for each category by choosing a variable to facet by.
# Visualize the number of missings in cases using `gg_miss_case()`
gg_miss_case(riskfactors)
# Explore the number of missings in cases using `gg_miss_case()` and facet by the variable `education`
gg_miss_case(riskfactors, facet = education)
# Visualize the number of missings in variables using `gg_miss_var()`
gg_miss_var(riskfactors)
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
# Explore the number of missings in variables using `gg_miss_var()` and facet by the variable `education`
gg_miss_var(riskfactors, facet = education)
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
Let’s practice a few different ways to visualize patterns of missingness using:
gg_miss_upset() to give an overall pattern of missingness.gg_miss_fct() for a dataset that has a factor of interest: marriage.gg_miss_span() to explore the missingness in a time series dataset.What do you notice with the missingness and the faceting in the data?
# Using the airquality dataset, explore the missingness pattern using gg_miss_upset()
gg_miss_upset(airquality)
# With the riskfactors dataset, explore how the missingness changes across the marital variable using gg_miss_fct()
gg_miss_fct(x = riskfactors, fct = marital)
# Using the pedestrian dataset, explore how the missingness of hourly_counts changes over a span of 3000
gg_miss_span(pedestrian, var = hourly_counts, span_every = 3000)
# Using the pedestrian dataset, explore the impact of month by facetting by month
# and explore how missingness changes for a span of 1000
gg_miss_span(pedestrian, var = hourly_counts , span_every = 1000, facet = month)
In chapter two, you will learn how to uncover hidden missing values like “missing” or “N/A” and replace them with NA. You will learn how to efficiently handle implicit missing values - those values implied to be missing, but not explicitly listed. We also cover how to explore missing data dependence, discussing Missing Completely at Random (MCAR), Missing At Random (MAR), Missing Not At Random (MNAR), and what they mean for your data analysis.
You have a dataset with missing values coded as "N/A", "missing", and "na". But before we go ahead and start replacing these with NA, we should get an idea of how big the problem is.
Use miss_scan_count to count the possible missings in the dataset, pacman, a dataset of pacman scores, containing three columns:
year: the year that person made that score. initial: the initials of the person. score: the scores of that person.
# Using the example dataframe
pacman<-readRDS("~/edi_imp/datacamp_JB/dealing/pacman.rds")
# Explore the strange missing values "N/A"
miss_scan_count(data = pacman, search = list("N/A"))
## # A tibble: 6 × 2
## Variable n
## <chr> <int>
## 1 year 0
## 2 month 0
## 3 day 0
## 4 initial 0
## 5 score 100
## 6 country 0
# Explore the strange missing values "missing"
miss_scan_count(data = pacman, search = list("missing"))
## # A tibble: 6 × 2
## Variable n
## <chr> <int>
## 1 year 93
## 2 month 93
## 3 day 93
## 4 initial 0
## 5 score 0
## 6 country 0
# Explore the strange missing values "na"
miss_scan_count(data = pacman, search = list("na"))
## # A tibble: 6 × 2
## Variable n
## <chr> <int>
## 1 year 100
## 2 month 100
## 3 day 100
## 4 initial 0
## 5 score 0
## 6 country 0
# Explore the strange missing values " " (a single space)
miss_scan_count(data = pacman, search = list(" "))
## # A tibble: 6 × 2
## Variable n
## <chr> <int>
## 1 year 0
## 2 month 0
## 3 day 0
## 4 initial 0
## 5 score 0
## 6 country 100
# Explore all of the strange missing values, "N/A", "missing", "na", " "
miss_scan_count(data = pacman, search = list("N/A", "missing","na", " "))
## # A tibble: 6 × 2
## Variable n
## <chr> <int>
## 1 year 193
## 2 month 193
## 3 day 193
## 4 initial 0
## 5 score 100
## 6 country 100
Following on from the previous dataset, we now know that we have a few strange missing values.
Now, we are going to do something about it, and replace these values with missings (e.g. NA) using the function replace_with_na().
# Print the top of the pacman data using `head()`
head(pacman)
## # A tibble: 6 × 6
## year month day initial score country
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2007 10 27 LEX 2065812 "CA"
## 2 1995 8 23 PNY 1163465 "JP"
## 3 1980 2 8 MBJ 175380 " "
## 4 1982 5 9 QRC 2025632 "ES"
## 5 na na na YPZ 925357 "NZ"
## 6 2013 11 15 RVJ 319733 "AU"
# Replace the strange missing values "N/A", "na", and
# "missing" with `NA` for the variables, year, and score
pacman_clean <- replace_with_na(pacman, replace = list(year = c("N/A", "na", "missing"),
score = c("N/A", "na", "missing")))
# Test if `pacman_clean` still has these values in it?
miss_scan_count(pacman_clean, search = list("N/A", "na", "missing"))
## # A tibble: 6 × 2
## Variable n
## <chr> <int>
## 1 year 0
## 2 month 193
## 3 day 193
## 4 initial 0
## 5 score 0
## 6 country 0
To reduce code repetition when replacing values with NA, use the “scoped variants” of replace_with_na():
replace_with_na_at()replace_with_na_if()replace_with_na_all()The syntax of replacement looks like this:
~.x == "N/A" This replaces all cases that are equal to "N/A".
~.x %in% c("N/A", "missing", "na", " ") Replaces all cases that have "N/A", "missing", "na", or " ".
# Use `replace_with_na_at()` to replace with NA
replace_with_na_at(pacman,
.vars = c("year", "month", "day"),
~.x %in% c("N/A", "missing", "na", " "))
## # A tibble: 2,000 × 6
## year month day initial score country
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2007 10 27 LEX 2065812 "CA"
## 2 1995 8 23 PNY 1163465 "JP"
## 3 1980 2 8 MBJ 175380 " "
## 4 1982 5 9 QRC 2025632 "ES"
## 5 <NA> <NA> <NA> YPZ 925357 "NZ"
## 6 2013 11 15 RVJ 319733 "AU"
## 7 2003 12 4 VKD 3322668 "US"
## 8 2016 9 9 ZIS 2137806 "CN"
## 9 2013 3 20 IYD 3059716 "CN"
## 10 1993 5 19 CHQ 231892 "AU"
## # … with 1,990 more rows
# Use `replace_with_na_if()` to replace with NA the character values using `is.character`
replace_with_na_if(pacman,
.predicate = is.character,
~.x %in% c("N/A", "missing", "na", " "))
## # A tibble: 2,000 × 6
## year month day initial score country
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2007 10 27 LEX 2065812 CA
## 2 1995 8 23 PNY 1163465 JP
## 3 1980 2 8 MBJ 175380 <NA>
## 4 1982 5 9 QRC 2025632 ES
## 5 <NA> <NA> <NA> YPZ 925357 NZ
## 6 2013 11 15 RVJ 319733 AU
## 7 2003 12 4 VKD 3322668 US
## 8 2016 9 9 ZIS 2137806 CN
## 9 2013 3 20 IYD 3059716 CN
## 10 1993 5 19 CHQ 231892 AU
## # … with 1,990 more rows
# Use `replace_with_na_all()` to replace with NA
replace_with_na_all(pacman, ~.x %in% c("N/A", "missing", "na", " "))
## # A tibble: 2,000 × 6
## year month day initial score country
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2007 10 27 LEX 2065812 CA
## 2 1995 8 23 PNY 1163465 JP
## 3 1980 2 8 MBJ 175380 <NA>
## 4 1982 5 9 QRC 2025632 ES
## 5 <NA> <NA> <NA> YPZ 925357 NZ
## 6 2013 11 15 RVJ 319733 AU
## 7 2003 12 4 VKD 3322668 US
## 8 2016 9 9 ZIS 2137806 CN
## 9 2013 3 20 IYD 3059716 CN
## 10 1993 5 19 CHQ 231892 AU
## # … with 1,990 more rows
We are going to explore a new dataset, frogger.
This dataset contains 4 scores per player recorded at different times: morning, afternoon, evening, and late_night.
Every player should have played 4 games, one at each of these times, but it looks like not every player completed all of these games.
Use the complete() function to make these implicit missing values explicit.
frogger <- read_excel("~/edi_imp/datacamp_JB/dealing/frogger.xlsx")
# Print the frogger data to have a look at it
frogger
## # A tibble: 15 × 3
## name time value
## <chr> <chr> <chr>
## 1 jesse morning 6678
## 2 jesse afternoon 800060
## 3 jesse evening 475528
## 4 jesse late_night 143533
## 5 andy morning 425115
## 6 andy afternoon 587468
## 7 andy late_night 111000
## 8 nic afternoon 588532
## 9 nic late_night 915533
## 10 dan morning 388148
## 11 dan evening 180912
## 12 alex morning 552670
## 13 alex afternoon 98355
## 14 alex evening 266055
## 15 alex late_night 121056
# Use `complete()` on the `time` and `name` variables to
# make implicit missing values explicit
frogger_tidy <- frogger %>% complete(time, name)
One type of missing value that can be obvious to deal with is where the first entry of a group is given, but subsequent entries are marked NA.
These missing values often result from empty values in spreadsheets to avoid entering multiple names multiple times; as well as for “human readability”.
This type of problem can be solved by using the fill() function from the tidyr package.
# Print the frogger data to have a look at it
frogger
## # A tibble: 15 × 3
## name time value
## <chr> <chr> <chr>
## 1 jesse morning 6678
## 2 jesse afternoon 800060
## 3 jesse evening 475528
## 4 jesse late_night 143533
## 5 andy morning 425115
## 6 andy afternoon 587468
## 7 andy late_night 111000
## 8 nic afternoon 588532
## 9 nic late_night 915533
## 10 dan morning 388148
## 11 dan evening 180912
## 12 alex morning 552670
## 13 alex afternoon 98355
## 14 alex evening 266055
## 15 alex late_night 121056
# Use `fill()` to fill down the name variable in the frogger dataset
frogger %>% fill(name)
## # A tibble: 15 × 3
## name time value
## <chr> <chr> <chr>
## 1 jesse morning 6678
## 2 jesse afternoon 800060
## 3 jesse evening 475528
## 4 jesse late_night 143533
## 5 andy morning 425115
## 6 andy afternoon 587468
## 7 andy late_night 111000
## 8 nic afternoon 588532
## 9 nic late_night 915533
## 10 dan morning 388148
## 11 dan evening 180912
## 12 alex morning 552670
## 13 alex afternoon 98355
## 14 alex evening 266055
## 15 alex late_night 121056
Now let’s put it together!
Use complete() and fill() together to fix explicit and implicitly missing values in the frogger dataset.
# Print the frogger data to have a look at it
frogger
## # A tibble: 15 × 3
## name time value
## <chr> <chr> <chr>
## 1 jesse morning 6678
## 2 jesse afternoon 800060
## 3 jesse evening 475528
## 4 jesse late_night 143533
## 5 andy morning 425115
## 6 andy afternoon 587468
## 7 andy late_night 111000
## 8 nic afternoon 588532
## 9 nic late_night 915533
## 10 dan morning 388148
## 11 dan evening 180912
## 12 alex morning 552670
## 13 alex afternoon 98355
## 14 alex evening 266055
## 15 alex late_night 121056
# Correctly fill() and complete() missing values so that our dataset becomes sensible
frogger %>%
fill(name) %>%
complete(name, time)
## # A tibble: 20 × 3
## name time value
## <chr> <chr> <chr>
## 1 alex afternoon 98355
## 2 alex evening 266055
## 3 alex late_night 121056
## 4 alex morning 552670
## 5 andy afternoon 587468
## 6 andy evening <NA>
## 7 andy late_night 111000
## 8 andy morning 425115
## 9 dan afternoon <NA>
## 10 dan evening 180912
## 11 dan late_night <NA>
## 12 dan morning 388148
## 13 jesse afternoon 800060
## 14 jesse evening 475528
## 15 jesse late_night 143533
## 16 jesse morning 6678
## 17 nic afternoon 588532
## 18 nic evening <NA>
## 19 nic late_night 915533
## 20 nic morning <NA>
We need to make certain assumptions about our data when we move further into analysis.
Which of the following answers about MCAR and MAR is TRUE?
Generally, deleting observations is safer for MCAR data than deleting observations for MAR data.
MCAR means missingness is related to data observed, whereas for MAR, missingness is related to data unobserved.
MAR and MCAR are effectively the same, the distinction is not important.
MCAR data is unrelated to data observed and unobserved. MAR data is related to data observed.
To learn about the structure of the missingness in data, you can explore how sorting changes how missingness is presented.
For the oceanbuoys dataset, explore the missingness with vis_miss(), and then arrange by a few different variables
This is not a definitive process, but it will get you started to ask the right questions of your data. We explore more powerful techniques in the next chapter.
# Arrange by year
oceanbuoys %>% arrange(year) %>% vis_miss()
# Arrange by latitude
oceanbuoys %>% arrange(latitude) %>% vis_miss()
# Arrange by wind_ew (wind east west)
oceanbuoys %>% arrange(wind_ew) %>% vis_miss()
Using the information from earlier on the oceanbuoys dataset, which of these statements makes the most appropriate statement on the missingness type?
Try using gg_miss_var(), and gg_miss_case(), faceting by year to get more information. For example:
library(naniar) gg_miss_var(oceanbuoys, facet = year)
Possible Answers
Data is MCAR: wind direction is important for explaining missingness
Data is MCAR: year is important for explaining missingness
Data is MAR: location is not important for explaining missingness
Data is MAR: year and location are both important for explaining missingness
Missing data can be tricky to think about, as they don’t usually proclaim themselves for you, and instead hide amongst the weeds of the data.
One way to help expose missing values is to change the way we think about the data - by thinking about every single data value being missing or not missing.
The as_shadow() function in R transforms a dataframe into a shadow matrix, a special data format where the values are either missing (NA), or Not Missing (!NA).
The column names of a shadow matrix are the same as the data, but have a suffix added _NA.
To keep track of and compare data values to their missingness state, use the bind_shadow() function. Having data in this format, with the shadow matrix column bound to the regular data is called nabular data.
# Create shadow matrix data with `as_shadow()`
as_shadow(oceanbuoys)
## # A tibble: 736 × 8
## year_NA latitude_NA longitude_NA sea_temp_c…¹ air_t…² humid…³ wind_…⁴ wind_…⁵
## <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
## 1 !NA !NA !NA !NA !NA !NA !NA !NA
## 2 !NA !NA !NA !NA !NA !NA !NA !NA
## 3 !NA !NA !NA !NA !NA !NA !NA !NA
## 4 !NA !NA !NA !NA !NA !NA !NA !NA
## 5 !NA !NA !NA !NA !NA !NA !NA !NA
## 6 !NA !NA !NA !NA !NA !NA !NA !NA
## 7 !NA !NA !NA !NA !NA !NA !NA !NA
## 8 !NA !NA !NA !NA !NA !NA !NA !NA
## 9 !NA !NA !NA !NA !NA !NA !NA !NA
## 10 !NA !NA !NA !NA !NA !NA !NA !NA
## # … with 726 more rows, and abbreviated variable names ¹sea_temp_c_NA,
## # ²air_temp_c_NA, ³humidity_NA, ⁴wind_ew_NA, ⁵wind_ns_NA
# Create nabular data by binding the shadow to the data with `bind_shadow()`
bind_shadow(oceanbuoys)
## # A tibble: 736 × 16
## year latit…¹ longi…² sea_t…³ air_t…⁴ humid…⁵ wind_ew wind_ns year_NA latit…⁶
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 1997 0 -110 27.6 27.1 79.6 -6.40 5.40 !NA !NA
## 2 1997 0 -110 27.5 27.0 75.8 -5.30 5.30 !NA !NA
## 3 1997 0 -110 27.6 27 76.5 -5.10 4.5 !NA !NA
## 4 1997 0 -110 27.6 26.9 76.2 -4.90 2.5 !NA !NA
## 5 1997 0 -110 27.6 26.8 76.4 -3.5 4.10 !NA !NA
## 6 1997 0 -110 27.8 26.9 76.7 -4.40 1.60 !NA !NA
## 7 1997 0 -110 28.0 27.0 76.5 -2 3.5 !NA !NA
## 8 1997 0 -110 28.0 27.1 78.3 -3.70 4.5 !NA !NA
## 9 1997 0 -110 28.0 27.2 78.6 -4.20 5 !NA !NA
## 10 1997 0 -110 28.0 27.2 76.9 -3.60 3.5 !NA !NA
## # … with 726 more rows, 6 more variables: longitude_NA <fct>,
## # sea_temp_c_NA <fct>, air_temp_c_NA <fct>, humidity_NA <fct>,
## # wind_ew_NA <fct>, wind_ns_NA <fct>, and abbreviated variable names
## # ¹latitude, ²longitude, ³sea_temp_c, ⁴air_temp_c, ⁵humidity, ⁶latitude_NA
# Bind only the variables with missing values by using bind_shadow(only_miss = TRUE)
bind_shadow(oceanbuoys, only_miss = TRUE)
## # A tibble: 736 × 11
## year latit…¹ longi…² sea_t…³ air_t…⁴ humid…⁵ wind_ew wind_ns sea_t…⁶ air_t…⁷
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 1997 0 -110 27.6 27.1 79.6 -6.40 5.40 !NA !NA
## 2 1997 0 -110 27.5 27.0 75.8 -5.30 5.30 !NA !NA
## 3 1997 0 -110 27.6 27 76.5 -5.10 4.5 !NA !NA
## 4 1997 0 -110 27.6 26.9 76.2 -4.90 2.5 !NA !NA
## 5 1997 0 -110 27.6 26.8 76.4 -3.5 4.10 !NA !NA
## 6 1997 0 -110 27.8 26.9 76.7 -4.40 1.60 !NA !NA
## 7 1997 0 -110 28.0 27.0 76.5 -2 3.5 !NA !NA
## 8 1997 0 -110 28.0 27.1 78.3 -3.70 4.5 !NA !NA
## 9 1997 0 -110 28.0 27.2 78.6 -4.20 5 !NA !NA
## 10 1997 0 -110 28.0 27.2 76.9 -3.60 3.5 !NA !NA
## # … with 726 more rows, 1 more variable: humidity_NA <fct>, and abbreviated
## # variable names ¹latitude, ²longitude, ³sea_temp_c, ⁴air_temp_c, ⁵humidity,
## # ⁶sea_temp_c_NA, ⁷air_temp_c_NA
Now that you can create nabular data, let’s use it to explore the data. Let’s calculate summary statistics based on the missingness of another variable.
To do this we are going to use the following steps:
First, bind_shadow() turns the data into nabular data.
Next, perform some summaries on the data using group_by() and summarize() to calculate the mean and standard deviation, using the mean() and sd() functions.
# `bind_shadow()` and `group_by()` humidity missingness (`humidity_NA`)
oceanbuoys %>%
bind_shadow() %>%
group_by(humidity_NA) %>%
summarize(wind_ew_mean = mean(wind_ew), # calculate mean of wind_ew
wind_ew_sd = sd(wind_ew)) # calculate standard deviation of wind_ew
## # A tibble: 2 × 3
## humidity_NA wind_ew_mean wind_ew_sd
## <fct> <dbl> <dbl>
## 1 !NA -3.78 1.90
## 2 NA -3.30 2.31
# Repeat this, but calculating summaries for wind north south (`wind_ns`)
oceanbuoys %>%
bind_shadow() %>%
group_by(humidity_NA) %>%
summarize(wind_ns_mean = mean(wind_ns),
wind_ns_sd = sd(wind_ns))
## # A tibble: 2 × 3
## humidity_NA wind_ns_mean wind_ns_sd
## <fct> <dbl> <dbl>
## 1 !NA 2.78 2.06
## 2 NA 1.66 2.23
It can be useful to get a bit of extra information about the number of cases in each missing condition.
In this exercise, we are going to add information about the number of observed cases using n() inside the summarize() function.
We will then add an additional level of grouping by looking at the combination of humidity being missing (humidity_NA) and air temperature being missing (air_temp_c_NA).
# Summarize wind_ew by the missingness of `air_temp_c_NA`
oceanbuoys %>%
bind_shadow() %>%
group_by(air_temp_c_NA) %>%
summarize(wind_ew_mean = mean(wind_ew),
wind_ew_sd = sd(wind_ew),
n_obs = n())
## # A tibble: 2 × 4
## air_temp_c_NA wind_ew_mean wind_ew_sd n_obs
## <fct> <dbl> <dbl> <int>
## 1 !NA -3.91 1.85 655
## 2 NA -2.17 2.14 81
# Summarize wind_ew by missingness of `air_temp_c_NA` and `humidity_NA`
oceanbuoys %>%
bind_shadow() %>%
group_by(air_temp_c_NA, humidity_NA) %>%
summarize(wind_ew_mean = mean(wind_ew),
wind_ew_sd = sd(wind_ew),
n_obs = n())
## `summarise()` has grouped output by 'air_temp_c_NA'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 5
## # Groups: air_temp_c_NA [2]
## air_temp_c_NA humidity_NA wind_ew_mean wind_ew_sd n_obs
## <fct> <fct> <dbl> <dbl> <int>
## 1 !NA !NA -4.01 1.74 565
## 2 !NA NA -3.24 2.31 90
## 3 NA !NA -2.06 2.08 78
## 4 NA NA -4.97 1.74 3
Summary statistics are useful to calculate, but as they say, a picture tells you a thousand words.
In this exercise, we are going to explore how you can use nabular data to explore the variation in a variable by the missingness of another.
We are going to use the oceanbuoys dataset from naniar.
# First explore the missingness structure of `oceanbuoys` using `vis_miss()`
vis_miss(oceanbuoys)
# Explore the distribution of `wind_ew` for the missingness
# of `air_temp_c_NA` using `geom_density()`
bind_shadow(oceanbuoys) %>%
ggplot(aes(x = wind_ew,
color = air_temp_c_NA)) +
geom_density()
# Explore the distribution of sea temperature for the
# missingness of humidity (humidity_NA) using `geom_density()`
bind_shadow(oceanbuoys) %>%
ggplot(aes(x = sea_temp_c,
color = humidity_NA)) +
geom_density()
## Warning: Removed 3 rows containing non-finite values (stat_density).
You can now use nabular data to visualize and explore missing data using density plots.
In this exercise, we are going to explore how to use nabular data to explore the variation in a variable by the missingness of another.
We are going to use the oceanbuoys dataset from naniar, and then create multiple plots of the data using facets.
This allows you to explore different layers of missingness.
# Explore the distribution of wind east west (wind_ew) for the missingness of air temperature
# using geom_density() and faceting by the missingness of air temperature (air_temp_c_NA).
oceanbuoys %>%
bind_shadow() %>%
ggplot(aes(x = wind_ew)) +
geom_density() +
facet_wrap(~air_temp_c_NA)
# Build upon this visualization by filling by the missingness of humidity (humidity_NA).
oceanbuoys %>%
bind_shadow() %>%
ggplot(aes(x = wind_ew,
color = humidity_NA)) +
geom_density() +
facet_wrap(~air_temp_c_NA)
You can now use nabular data to visualize and explore missing data.
Previous exercises use nabular data along with density plots to explore the variation in a variable by the missingness of another.
We are going to use the oceanbuoys dataset from naniar, using box plots instead of facets or others to explore different layers of missingness.
# Explore the distribution of wind east west (`wind_ew`) for
# the missingness of air temperature using `geom_boxplot()`
oceanbuoys %>%
bind_shadow() %>%
ggplot(aes(x = air_temp_c_NA,
y = wind_ew)) +
geom_boxplot()
# Build upon this visualization by faceting by the missingness of humidity (`humidity_NA`).
oceanbuoys %>%
bind_shadow() %>%
ggplot(aes(x = air_temp_c_NA,
y = wind_ew)) +
geom_boxplot() +
facet_wrap(~humidity_NA)
You can now use nabular data to visualize and explore missing data with box plots and facet wraps.
Missing values in a scatter plot in ggplot2 are removed by default, with a warning.
We can display missing values in a scatter plot, using geom_miss_point() - a special ggplot2 geom that shifts the missing values into the plot, displaying them 10% below the minimum of the variable.
Let’s practice using this visualization with the oceanbuoys dataset.
# Explore the missingness in wind and air temperature, and
# display the missingness using `geom_miss_point()`
ggplot(oceanbuoys,
aes(x = wind_ew,
y = air_temp_c)) +
geom_miss_point()
# Explore the missingness in humidity and air temperature,
# and display the missingness using `geom_miss_point()`
ggplot(oceanbuoys,
aes(x = humidity,
y = air_temp_c)) +
geom_miss_point()
### Using facets to explore missingness
Because geom_miss_point() is a ggplot geom, you can use it with ggplot2 features like faceting.
This means we can rapidly explore the missingness and stay within the familiar bounds of ggplot2.
# Explore the missingness in wind and air temperature, and display the
# missingness using `geom_miss_point()`. Facet by year to explore this further.
ggplot(oceanbuoys,
aes(x = wind_ew,
y = air_temp_c)) +
geom_miss_point() +
facet_wrap(~year)
# Explore the missingness in humidity and air temperature, and display the
# missingness using `geom_miss_point()` Facet by year to explore this further.
ggplot(oceanbuoys,
aes(x = humidity,
y = air_temp_c)) +
geom_miss_point() +
facet_wrap(~year)
Now you can use geom_miss_point() to explore missing values in a scatter plot, and can use faceting to expand and explore further.
Another useful technique with geommisspoint() is to explore the missingness by creating multiple plots.
Just as we have done in the previous exercises, we can use the nabular data to help us create additional faceted plots.
We can even create multiple faceted plots according to values in the data, such as year, and features of the data, such as missingness.
# Use geom_miss_point() and facet_wrap to explore how the missingness
# in wind_ew and air_temp_c is different for missingness of humidity
bind_shadow(oceanbuoys) %>%
ggplot(aes(x = wind_ew,
y = air_temp_c)) +
geom_miss_point() +
facet_wrap(~humidity_NA)
# Use geom_miss_point() and facet_grid to explore how the missingness in wind_ew and air_temp_c
# is different for missingness of humidity AND by year - by using `facet_grid(humidity_NA ~ year)`
bind_shadow(oceanbuoys) %>%
ggplot(aes(x = wind_ew,
y = air_temp_c)) +
geom_miss_point() +
facet_grid(humidity_NA~year)
We want to keep track of values we imputed. If we don’t, it is very difficult to assess how good the imputed values are.
We are going to practice imputing data and recreate visualizations in the previous set of exercises by imputing values below the range of the data.
This is a very useful way to help further explore missingness, and also provides the framework for imputing missing values.
First, we are going to impute the data below the range using impute_below_all(), and then visualize the data. We notice that although we can see where the missing values are in this instance, we need some way to track them. The track missing data programming pattern can help with this.
# Impute the data below the range using `impute_below`.
ocean_imp <- impute_below_all(oceanbuoys)
# Visualize the new missing values
ggplot(ocean_imp,
aes(x = wind_ew, y = air_temp_c)) +
geom_point()
# Impute and track data with `bind_shadow`, `impute_below`, and `add_label_shadow`
ocean_imp_track <- bind_shadow(oceanbuoys) %>%
impute_below_all() %>%
add_label_shadow()
# Look at the imputed values
ocean_imp_track
## # A tibble: 736 × 17
## year latit…¹ longi…² sea_t…³ air_t…⁴ humid…⁵ wind_ew wind_ns year_NA latit…⁶
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 1997 0 -110 27.6 27.1 79.6 -6.40 5.40 !NA !NA
## 2 1997 0 -110 27.5 27.0 75.8 -5.30 5.30 !NA !NA
## 3 1997 0 -110 27.6 27 76.5 -5.10 4.5 !NA !NA
## 4 1997 0 -110 27.6 26.9 76.2 -4.90 2.5 !NA !NA
## 5 1997 0 -110 27.6 26.8 76.4 -3.5 4.10 !NA !NA
## 6 1997 0 -110 27.8 26.9 76.7 -4.40 1.60 !NA !NA
## 7 1997 0 -110 28.0 27.0 76.5 -2 3.5 !NA !NA
## 8 1997 0 -110 28.0 27.1 78.3 -3.70 4.5 !NA !NA
## 9 1997 0 -110 28.0 27.2 78.6 -4.20 5 !NA !NA
## 10 1997 0 -110 28.0 27.2 76.9 -3.60 3.5 !NA !NA
## # … with 726 more rows, 7 more variables: longitude_NA <fct>,
## # sea_temp_c_NA <fct>, air_temp_c_NA <fct>, humidity_NA <fct>,
## # wind_ew_NA <fct>, wind_ns_NA <fct>, any_missing <chr>, and abbreviated
## # variable names ¹latitude, ²longitude, ³sea_temp_c, ⁴air_temp_c, ⁵humidity,
## # ⁶latitude_NA
Now, let’s recreate one of the previous plots we saw in chapter three that used geom_miss_point().
To do this, we need to impute the data below the range of the data. This is a special kind of imputation to explore the data. This imputation will illustrate what we need to practice: how to track missing values. To impute the data below the range of the data, we use the function impute_below_all().
# Impute and track the missing values
ocean_imp_track <- bind_shadow(oceanbuoys) %>%
impute_below_all() %>%
add_label_shadow()
# Visualize the missingness in wind and air temperature,
# coloring missing air temp values with air_temp_c_NA
ggplot(ocean_imp_track,
aes(x = wind_ew, y = air_temp_c, color = air_temp_c_NA)) +
geom_point()
# Visualize humidity and air temp, coloring any missing cases using the variable any_missing
ggplot(ocean_imp_track,
aes(x = humidity, y = air_temp_c, color = any_missing)) +
geom_point()
Now that we can recreate the first visualization of geom_miss_point(), let’s explore how we can apply this to other exploratory tasks.
One useful task is to evaluate the number of missings in a given variable using a histogram. We can do this using the ocean_imp_track dataset we created in the last exercise, which is loaded into this session.
# Explore the values of air_temp_c, visualizing the amount of missings with `air_temp_c_NA`.
p <- ggplot(ocean_imp_track, aes(x = air_temp_c, fill = air_temp_c_NA)) + geom_histogram()
# Expore the missings in humidity using humidity_NA
p2 <- ggplot(ocean_imp_track, aes(x = humidity, fill = humidity_NA)) + geom_histogram()
# Explore the missings in air_temp_c according to year, using `facet_wrap(~year)`.
p + facet_wrap(~year)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Explore the missings in humidity according to year, using `facet_wrap(~year)`.
p2 + facet_wrap(~year)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In order to evaluate imputations, it helps to know what something bad looks like. To explore this, let’s look at a typically bad imputation method: imputing using the mean value.
In this exercise we are going to explore how the mean imputation method works using a box plot, using the oceanbuoys dataset.
# Impute the mean value and track the imputations
ocean_imp_mean <- bind_shadow(oceanbuoys) %>%
impute_mean_all() %>%
add_label_shadow()
# Explore the mean values in humidity in the imputed dataset
ggplot(ocean_imp_mean,
aes(x = humidity_NA, y = humidity)) +
geom_boxplot()
# Explore the values in air temperature in the imputed dataset
ggplot(ocean_imp_mean,
aes(x = air_temp_c_NA, y = air_temp_c)) +
geom_boxplot()
### Evaluating imputations: The scale
While the mean imputation might not look so bad when we compare it using a box plot, it is important to get a sense of the variation in the data. This is why it is important to explore how the scale and spread of imputed values changes compared to the data.
One way to evaluate the appropriateness of the scale of the imputations is to use a scatter plot to explore whether or not the values are appropriate.
# Explore imputations in air temperature and humidity,
# coloring by the variable, any_missing
ggplot(ocean_imp_mean,
aes(x = air_temp_c, y = humidity, color = any_missing)) +
geom_point()
# Explore imputations in air temperature and humidity,
# coloring by the variable, any_missing, and faceting by year
ggplot(ocean_imp_mean,
aes(x = air_temp_c, y = humidity, color = any_missing)) +
geom_point() +
facet_wrap(~year)
### Evaluating imputations: Across many variables
So far, we have covered ways to look at individual variables or pairs of variables and their imputed values. However, sometimes you want to look at imputations for many variables. To do this, you need to perform some data munging and re-arranging. This lesson covers how to perform this data wrangling, which can get a little bit hairy when considering its usage in nabular data. The function, shadow_long() gets the data into the right shape for these kinds of visualizations.
# Gather the imputed data
ocean_imp_mean_gather <- shadow_long(ocean_imp_mean,
humidity,
air_temp_c)
# Inspect the data
ocean_imp_mean_gather
## # A tibble: 1,472 × 4
## variable value variable_NA value_NA
## <chr> <chr> <chr> <chr>
## 1 air_temp_c 27.14999962 air_temp_c_NA !NA
## 2 air_temp_c 27.02000046 air_temp_c_NA !NA
## 3 air_temp_c 27 air_temp_c_NA !NA
## 4 air_temp_c 26.93000031 air_temp_c_NA !NA
## 5 air_temp_c 26.84000015 air_temp_c_NA !NA
## 6 air_temp_c 26.94000053 air_temp_c_NA !NA
## 7 air_temp_c 27.04000092 air_temp_c_NA !NA
## 8 air_temp_c 27.11000061 air_temp_c_NA !NA
## 9 air_temp_c 27.20999908 air_temp_c_NA !NA
## 10 air_temp_c 27.25 air_temp_c_NA !NA
## # … with 1,462 more rows
ocean_imp_mean_gather$value<-as.numeric(ocean_imp_mean_gather$value)
# Explore the imputations in a histogram
ggplot(ocean_imp_mean_gather,
aes(x = value, fill = value_NA)) +
geom_histogram() +
facet_wrap(~variable)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There are many imputation packages in R. We are going to focus on using the simputation package, which provides a simple, powerful interface into performing imputations.
Building a good imputation model is super important, but it is a complex topic - there is as much to building a good imputation model as there is for building a good statistical model. In this course, we are going to focus on how to evaluate imputations.
First, we are going to look at using impute_lm() function, which imputes values according to a specified linear model.
In this exercise, we are going to apply the previous assessment techniques to data with impute_lm(), and then build upon this imputation method in subsequent lessons.
library(simputation)
##
## Attaching package: 'simputation'
## The following object is masked from 'package:naniar':
##
## impute_median
# Impute humidity and air temperature using wind_ew and wind_ns, and track missing values
ocean_imp_lm_wind <- oceanbuoys %>%
bind_shadow() %>%
impute_lm(air_temp_c ~ wind_ew + wind_ns) %>%
impute_lm(humidity ~ wind_ew + wind_ns) %>%
add_label_shadow()
# Plot the imputed values for air_temp_c and humidity, colored by missingness
ggplot(ocean_imp_lm_wind,
aes(x = air_temp_c, y = humidity, color = any_missing)) +
geom_point()
### Evaluating and comparing imputations
When you build up an imputation model, it’s a good idea to compare it to another method. In this lesson, we are going to compare the previously imputed dataset created using impute_lm() to the mean imputed dataset. Both of these datasets are included in this exercise as ocean_imp_lm_wind and ocean_imp_mean respectively.
# Bind the models together
bound_models <- bind_rows(mean = ocean_imp_mean,
lm_wind = ocean_imp_lm_wind,
.id = "imp_model")
# Inspect the values of air_temp and humidity as a scatter plot
ggplot(bound_models,
aes(x = air_temp_c,
y = humidity,
color = any_missing)) +
geom_point() +
facet_wrap(~imp_model)
### Evaluating imputations (many models & variables)
When you build up an imputation model, it’s a good idea to compare it to another method.
In this lesson, we are going to get you to add a final imputation model that contains an extra useful piece of information that helps explain some of the variation in the data. You are then going to compare the values, as previously done in the last lesson.
# Build a model adding year to the outcome
ocean_imp_lm_wind_year <- bind_shadow(oceanbuoys) %>%
impute_lm(air_temp_c ~ wind_ew + wind_ns + year) %>%
impute_lm(humidity ~ wind_ew + wind_ns + year) %>%
add_label_shadow()
# Bind the mean, lm_wind, and lm_wind_year models together
bound_models <- bind_rows(mean = ocean_imp_mean,
lm_wind = ocean_imp_lm_wind,
lm_wind_year = ocean_imp_lm_wind_year,
.id = "imp_model")
# Explore air_temp and humidity, coloring by any missings, and faceting by imputation model
ggplot(bound_models, aes(x = air_temp_c, y = humidity, color = any_missing)) +
geom_point() + facet_wrap(~imp_model)
To evaluate the different imputation methods, we need to put them into a single dataframe. Next, you will compare three different approaches to handling missing data using the dataset, oceanbuoys.
ocean_cc.ocean_imp_lm_wind.You will create the third imputed dataset, ocean_imp_lm_all, using a linear model and impute the variables sea_temp_c, air_temp_c, and humidity using the variables wind_ew, wind_ns, year, latitude, longitude.
You will then bind all of the datasets together (ocean_cc, ocean_imp_lm_wind, and ocean_imp_lm_all), calling it bound_models.
ocean_cc<-oceanbuoys %>%
na.omit() %>%
bind_shadow %>%
add_label_shadow()
# Create an imputed dataset using a linear models
ocean_imp_lm_all <- bind_shadow(oceanbuoys) %>%
add_label_shadow() %>%
impute_lm(sea_temp_c ~ wind_ew + wind_ns + year + latitude + longitude) %>%
impute_lm(air_temp_c ~ wind_ew + wind_ns + year + latitude + longitude) %>%
impute_lm(humidity ~ wind_ew + wind_ns + year + latitude + longitude)
# Bind the datasets
bound_models <- bind_rows(cc = ocean_cc,
imp_lm_wind = ocean_imp_lm_wind,
imp_lm_all = ocean_imp_lm_all,
.id = "imp_model")
# Look at the models
bound_models
## # A tibble: 2,037 × 18
## imp_m…¹ year latit…² longi…³ sea_t…⁴ air_t…⁵ humid…⁶ wind_ew wind_ns year_NA
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 cc 1997 0 -110 27.6 27.1 79.6 -6.40 5.40 !NA
## 2 cc 1997 0 -110 27.5 27.0 75.8 -5.30 5.30 !NA
## 3 cc 1997 0 -110 27.6 27 76.5 -5.10 4.5 !NA
## 4 cc 1997 0 -110 27.6 26.9 76.2 -4.90 2.5 !NA
## 5 cc 1997 0 -110 27.6 26.8 76.4 -3.5 4.10 !NA
## 6 cc 1997 0 -110 27.8 26.9 76.7 -4.40 1.60 !NA
## 7 cc 1997 0 -110 28.0 27.0 76.5 -2 3.5 !NA
## 8 cc 1997 0 -110 28.0 27.1 78.3 -3.70 4.5 !NA
## 9 cc 1997 0 -110 28.0 27.2 78.6 -4.20 5 !NA
## 10 cc 1997 0 -110 28.0 27.2 76.9 -3.60 3.5 !NA
## # … with 2,027 more rows, 8 more variables: latitude_NA <fct>,
## # longitude_NA <fct>, sea_temp_c_NA <fct>, air_temp_c_NA <fct>,
## # humidity_NA <fct>, wind_ew_NA <fct>, wind_ns_NA <fct>, any_missing <chr>,
## # and abbreviated variable names ¹imp_model, ²latitude, ³longitude,
## # ⁴sea_temp_c, ⁵air_temp_c, ⁶humidity
We are imputing our data for a reason - we want to analyze the data!
In this example, we are interested in predicting sea temperature, so we will build a linear model predicting sea temperature.
We will fit this model to each of the datasets we created and then explore the coefficients in the data.
The objects from the previous lesson (ocean_cc, ocean_imp_lm_wind, ocean_imp_lm_all, and bound_models) are loaded into the workspace.
# Create the model summary for each dataset
model_summary <- bound_models %>%
group_by(imp_model) %>%
nest() %>%
mutate(mod = purrr::map(data, ~lm(sea_temp_c ~ air_temp_c + humidity + year, data = .)),
res = purrr::map(mod, residuals),
pred = purrr::map(mod, predict),
tidy = purrr::map(mod, broom::tidy))
# Explore the coefficients in the model
model_summary %>%
select(imp_model,tidy) %>%
unnest()
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(tidy)`
## # A tibble: 12 × 6
## # Groups: imp_model [3]
## imp_model term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 cc (Intercept) -735. 45.9 -16.0 8.19e- 48
## 2 cc air_temp_c 0.864 0.0231 37.4 2.64e-154
## 3 cc humidity 0.0341 0.00390 8.74 2.69e- 17
## 4 cc year 0.369 0.0232 15.9 3.46e- 47
## 5 imp_lm_wind (Intercept) -1742. 56.1 -31.0 1.83e-135
## 6 imp_lm_wind air_temp_c 0.365 0.0279 13.1 2.73e- 35
## 7 imp_lm_wind humidity 0.0225 0.00690 3.26 1.17e- 3
## 8 imp_lm_wind year 0.880 0.0283 31.1 6.79e-136
## 9 imp_lm_all (Intercept) -697. 51.8 -13.5 5.04e- 37
## 10 imp_lm_all air_temp_c 0.890 0.0255 35.0 2.90e-158
## 11 imp_lm_all humidity 0.0127 0.00463 2.75 6.03e- 3
## 12 imp_lm_all year 0.351 0.0262 13.4 1.12e- 36
best_model <- "imp_lm_all"